Take-home_Ex02
Edge data should be organised as such: (can use dplyr methods)
First column: Source id (FK to Node second column) - compulsory
Second column: Target id (FK to Node second column) - compulsory
Node data
First column: ID (contains all the distinct values of source and target in Edge data) - compulsory
- Nodes present in edge data must exists in ID of node data, must not have missing in node ID.
Second column: Label (only need if Id are all integers)
Try not to use R built-in NA/NULL function. Manually type “unknown’ / ‘missing’ as a value instead.
Vast Challenge 2023 Mini Challenge 2 (Subtask: 1)
In this challenge, my group and I seek to use visual analytics to identify temporal patterns for individual entities and between entities using the knowledge graph the FishEye Organisation has provided us with. We will also be categorising the type of business relationship patterns found. The visualisation outputs:
1) Interactive network graph with nodes coloured by varying range of betweenness, in-deg and out-deg centrality scores to spot the companies with high / low centrality scores easily.
2) Heatmap for top n companies (shipping and receiving) over the years 2028 to 2034 allows us to see all trading activities at one glance. We could try to spot companies with short trading duration.
3) Using two top receiving companies as an example, iteractive time-series charts (line chart) are plotted with four of their supplier companies to spot trends and anomalies. We could try to spot receiving companies who keeps changing suppliers with this.
4) Coordinated and interactive scatterplots of ‘number of interactions’ and ‘partnership duration’ between each unique pair of ship-receive companies were plotted. We could try to spot outliers and investigate them!
1 About the dataset
1.1 Data dictionary
Node Attributes:
id -- Name of the company that originated (or received) the shipment
shpcountry -- Country the company most often associated with when shipping
rcvcountry -- Country the company most often associated with when receiving
dataset -- Always ‘MC2’
Edge Attributes:
arrivaldate -- Date the shipment arrived at port in YYYY-MM-DD format.
hscode -- Harmonized System code for the shipment. Can be joined with the hscodes table to get additional details.
valueofgoods_omu -- Customs-declared value of the total shipment, in Oceanus
Monetary Units (OMU)
volumeteu -- The volume of the shipment in ‘Twenty-foot equivalent units’, roughly how many 20-foot standard containers would be required. (Actual number of containers may have been different as there are 20ft and 40ft standard containers and tankers that do not use containers)
weightkg -- The weight of the shipment in kilograms (if known)
dataset -- Always ‘MC2’
type -- Always ‘shipment’ for MC2
generated_by -- Name of the program that generated the edge. (Only found on ‘bundle’ records.)
1.2 Importing the datasets
Import libraries
The new libraries used today are :
jsonliteto import json file
Pull out the nodes and edge data and save them as tibble data frames.
Rows: 34,576
Columns: 3
$ id <chr> "AquaDelight Inc and Son's", "BaringoAmerica Marine Ges.m.b…
$ shpcountry <chr> "Polarinda", NA, "Oceanus", NA, "Oceanus", "Kondanovia", NA…
$ rcvcountry <chr> "Oceanus", NA, "Oceanus", NA, "Oceanus", "Utoporiana", NA, …
Rearranging the columns in edge file as we require source and target columns to be the first two columns.
Rows: 5,464,378
Columns: 8
$ source <chr> "AquaDelight Inc and Son's", "AquaDelight Inc and Son…
$ target <chr> "BaringoAmerica Marine Ges.m.b.H.", "BaringoAmerica M…
$ arrivaldate <chr> "2034-02-12", "2034-03-13", "2028-02-07", "2028-02-23…
$ hscode <chr> "630630", "630630", "470710", "470710", "470710", "47…
$ valueofgoods_omu <dbl> 141015, 141015, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ volumeteu <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weightkg <int> 4780, 6125, 10855, 11250, 11165, 11290, 9000, 19490, …
$ valueofgoodsusd <dbl> NA, NA, NA, NA, NA, NA, 87110, 188140, NA, 221110, 58…
2 Data cleaning
2.1 Lets check for duplicates
For MC2_nodes dataframe:
There are no duplicated nodes, which is great.
For MC2_edges dataframe:
There are about 155291 records (2% out of total records) that are duplicated.
Show the code
[1] 155291
We will drop the duplicates.
2.2 Check for null values
Check whether each column in MC2_nodes and MC2_edges contains null and prints the percentage of null for each column.
For MC2_nodes dataframe:
There are no null values in the id column of Nodes file, which is great.
Show the code
# Check for null values in each column
null_counts_nodes <- sapply(MC2_nodes, function(x) sum(is.null(x) | is.na(x)))
# Calculate the percentage of null values for each column
null_percentages_nodes <- null_counts_nodes / nrow(MC2_nodes) * 100
# Display the results
#print(null_percentages)
knitr::kable(null_percentages_nodes, "simple", col.names = c("Null Percentage"))| Null Percenta | ge |
|---|---|
| id | 0.00000 |
| shpcountry | 64.66624 |
| rcvcountry | 8.41335 |
For MC2_edges dataframe:
As there are a lot zeros inside MC2_edges$volumteu col, we will consider 0 as equivalent to null values.
We can see that the columns valueofgoods_omu and volumeteu are mainly null. valueofgoodusd column contains more than 50% null values. There are 4 records of source with 0 as value, but 0 is their unique identifier so we do not consider 0 as null in source column. It means to say that only source, target, arrivaldate, hscode and weight columns will be helpful in our analysis.
Show the code
# Check for null values in each column
null_counts <- sapply(MC2_edges_no_dup, function(x) sum(is.null(x) | is.na(x) | x==0))
# Calculate the percentage of null values for each column
null_percentages <- null_counts / nrow(MC2_edges_no_dup) * 100
# Display the results
#print(null_percentages)
knitr::kable(null_percentages, "simple", col.names = c('Percentage null'))| Percentage null | |
|---|---|
| source | 0.0000753 |
| target | 0.0000000 |
| arrivaldate | 0.0000000 |
| hscode | 0.0000000 |
| valueofgoods_omu | 99.9947072 |
| volumeteu | 85.2925183 |
| weightkg | 0.9012849 |
| valueofgoodsusd | 54.4461223 |
We will be dropping the valueofgoods_omu , valueofgoodusdand volumeteu columns from our dataframe by selecting only the columns that we need.
2.3 Check on the HScodes
Check the unique number of hscodes in the dataset. There are 4761 unique HScodes, however many are not related to fishing related activties.
[1] 4761
With reference two websites on “Harmonized System of Nomenclature” namely, World Custom Organisation Harmonized System codes and connect2India , we will filter for records that have HScodes starting with 1604xx, 1605xx and 301xxx to 308xxx as they refer to seafood commodities, thus removing many other transactions like ‘television’, ‘steel parts’ etc… This will help to filter away the noises and help us to focus on trading activities related to the fishing industry.
2.4 Preparation of Edges
We will perform a group by ‘source’, ‘target’ and ‘arrivaldate’ and aggregate the total count of interactions, ‘Weight’, between each pair of companies. At this moment, we should not be filtering the records because we would like to calculate the network centrality scores first before zooming into records of interest. As we are interested to see whether there are patterns of self-loops, we will not be removing any company shipping to itself.
Let us wrangle the date columns to prepare dataframe for temporal analysis later.
(1) change the arrivaldate column to date data type
(2) create year, month, weekday, weeknumber columns
TBC: Inspect the frequency of source and target actors, and remove those actors below a frequency count of 5.
TBC: First , we remove low frequency source actors under 5 counts.
Next, remove target actors with frequency count less than 5:
2.5 Preparation of Nodes
We will include only nodes that are in ‘source’ and ‘target’ columns in the mc2_seafood_edges_agg dataframe after the first round of data filtering.
2.6 Creating the network graph dataframe using tbl_graph() of the tidygraph package.
Node file needs to have ID of nodes as first column.
Edge file need to contain source and target as column 1 and 2.
To create the network graph dataframe
The dataframe ‘seafood_graph’ has 11539 nodes with 374709 edges. It is a directed graph with 214 components.
# A tbl_graph: 11539 nodes and 374709 edges
#
# A directed multigraph with 214 components
#
# A tibble: 11,539 × 3
id shpcountry rcvcountry
<chr> <chr> <chr>
1 AquaDelight Inc and Son's Polarinda Oceanus
2 Yu gan Sea spray GmbH Industrial Oceanus Oceanus
3 Olas del Mar Worldwide Oceanus Oceanus
4 French Crab S.p.A. Worldwide Kondanovia Utoporiana
5 Panope Limited Liability Company Vesperanda Oceanus
6 hǎi dǎn Corporation Wharf Marebak Oceanus
# ℹ 11,533 more rows
#
# A tibble: 374,709 × 11
from to arrivaldate weight sum_goods_weightkg hscode year month day
<int> <int> <date> <int> <int> <chr> <dbl> <dbl> <int>
1 1632 5401 2029-12-21 2 42950 307590 2029 12 21
2 1632 5401 2030-01-19 2 42585 307590 2030 1 19
3 1632 5401 2030-06-25 1 21110 307590 2030 6 25
# ℹ 374,706 more rows
# ℹ 2 more variables: weekday <ord>, weeknumber <dbl>
Running the code chunk below to confirm that ‘seafood_graph’ is not a connected graph.
2.7 Calculate the various centrality measures of seafood_graph.
The top 10 nodes with reference to various centrality scores are printed using kable() function from knitr.
Reference was made from this link. The tidyverse centrality functions can be taken from here.
First compute ‘betweenness’, ‘in-deg’ and ‘out-deg’ and ‘pagerank’ scores. All my betweenness scores were zero (Investigation in progress)…
Show the code
seafood_graph<- seafood_graph %>%
activate("nodes") %>%
mutate(betweenness_centrality = centrality_betweenness(directed = TRUE)) %>%
mutate(in_deg_centrality = centrality_degree(weights = weight,
mode = "in")) %>%
mutate(out_deg_centrality = centrality_degree(weights = weight,
mode = "out")) %>%
mutate(pagerank = centrality_pagerank(weights = weight,
directed = TRUE)) #%>%
#mutate(community = as.factor(group_edge_betweenness(weights = Weight,
#directed = TRUE,
#n = 10)))Let us take a look at the top 10 nodes with high ‘betweenness’ centrality scores:
To see the top 20 nodes with ‘out-deg’ scores:
3 Interactive graphs of network structure, temporal analysis and business patterns detection
3.1 Interactive graph for top 3 BETWEENNESS CENTRALITY companies
We will attempt to plot a network involving the top 3 companies including ‘Drakensberg S.A. de C.V. Marine ecology’,‘Playa de Oro BV’ and ‘bái suō wěn lú S.p.A.’ in terms of high betweenness centrality scores. We might be able to see their relationship through the graph.
Getting a list of top 3 companies in terms of betweeness scores.
Filter all records in the edge file involving the top 3 companies. There are 586 records after this step.
The code chunk below first extracts the node dataframe from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.
Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_bet’ edge file. There are 96 nodes after this step.
Calculate the total goods shipped between each pair of companies (for edge tooltip later)
Rename the first two columns in the edge file to from and to for visNetwork to be able to recognize them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.
Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip. If we want to colour the nodes by their shipping countries, then we would have to rename the shpcountry column to group because visNetwork looks for group column to colour the nodes. However, we will not do this now.
Show the code
# further processing
nodes_seafood_vis_bet <- nodes_seafood_bet %>%
#rename(group= rcvcountry) %>%
mutate(pagerank = round(pagerank, 5)) %>%
mutate(title = paste('shpcountry =', shpcountry, ',',
'rcvcountry =', rcvcountry, ',',
'\n In-deg = ',in_deg_centrality, ',',
"\n Betweenness =", betweenness_centrality, ',',
"\n Out-deg =", out_deg_centrality))As we want the nodes to be colour-coded by the betweenness centrality scores, we have first have to bin the betweenness scores into intervals using the cut() function. Next, rename the bet_grp column to group column for VisNetwork to colour nodes by betweenness intervals.
Show the code
bet_brks <- c(0, 500000, 1000000, 1500000)
grps <- c('500,000 & Below','500,001 -1,000,000', '1,000,000 - 1,254,681')
nodes_seafood_vis_bet$bet_grp <- cut(nodes_seafood_vis_bet$betweenness_centrality, breaks=bet_brks, labels = grps,include.lowest = TRUE)
#nodes_seafood_vis_in$in_deg_grp <- factor(nodes_seafood_vis$in_deg_grp, ordered = TRUE, levels = c('3001-6132','2001-3000','1001-2000','501-1000','500 & Below'))
nodes_seafood_vis_bet <- nodes_seafood_vis_bet %>%
rename(group = bet_grp)The code chunk below plots the interactive network graph.
Show the code
set.seed(1234)
visNetwork(nodes_seafood_vis_bet,
mc2_seafood_edges_agg_vis_bet,
main = "Ego Network of top 3 Betweenness companies",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(arrows = 'to',
smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)Interactive features of the graph above
(1) Select Id dropdown list
(2) Select Group dropdown list: The values inside refers to the range of ‘in-deg’ centrality scores of the nodes. The pink colour node will represent the highest in-deg centrality score, followed by green, yellow, red and blue.
(3) Zoom in to see the node labels, and arrows direction.
(4) Drag a particular node away from the cluster to admire it.
(5) Hover mouse over a node will display tooltip (shpcountry, rcvcountry, In-deg, pagerank and out-deg score). It will also display the ‘ego’ network with itself at the ego. Click on the node to freeze the ego network. Click on blank space to reset.
(6) Hovering the mouse over an edge will display tooltip (Total weight of cargo, hscode of cargo)
(7) Click and Drag on the graph to move the canvas around, will also temporary disable the edge lines.
The graph above shows the egonetwork of the top 3 betweenness companies and we can even see their tradiing relationship. The nodes in red are the companies with the top 2 betweenness centrality values, the nodes in yellow are ranked lower while the nodes in blue have relatively much lower centrality scores. The table below shows the top 10 companies in terms of centrality values.
Show the code
| id | betweenness_centrality | out_deg_centrality | in_deg_centrality |
|---|---|---|---|
| Drakensberg S.A. de C.V. Marine ecology | 1254680.7 | 18 | 133 |
| Playa de Oro BV | 1231661.8 | 605 | 15 |
| bái suō wěn lú S.p.A. | 839665.4 | 42 | 95 |
| Punjab s Marine conservation | 804016.8 | 181 | 2716 |
| David Ltd. Liability Co Forwading | 640974.3 | 70 | 4473 |
| Yenisei Eel GmbH & Co. KG Services | 599808.9 | 1871 | 500 |
| Costa de Oro S.p.A. | 563721.1 | 113 | 175 |
| Marine Mates NV Worldwide | 560574.1 | 424 | 874 |
| Liumbwe GmbH & Co. KG | 551504.4 | 130 | 3 |
| Turkish Salmon A/S Marine | 470845.1 | 699 | 1132 |
It is observed that the companies with high betweenness scores served as both ‘source’ and ‘target’ companies; meaning they buy and sell seafood related products. Companies like ‘Drakensberg S.A. de C.V. Marine ecology’ focus more on buying than selling, while ‘Playa de Oro BV’ focuses more on selling than buying.
The business relationship observed: ‘Drakensberg S.A. de C.V. Marine ecology’ (ranked 1st) is one of the suppliers to ‘Playa de Oro BV’ (ranked 2nd) who in turns supplies to ‘bái suō wěn lú S.p.A.’ (ranked 3rd) who sells to many other companies. To reveal more relationships, the graph can be extended (to see the egonetwork of more yellow nodes ) by increasing the top n companies in the first code chunk in this section. For example, if we increase the size of the our network graph to ego network of top 5 betweenness companies (instead of 3), we would get

3.2 Interactive graph for top 5 IN_DEG CENTRALITY companies
In this section, we will focus on companies with top in-deg centrality scores.
First, obtain the list of top 5 in deg companies
Filter all records in the edge file involving the top 5 in-deg companies.
Next calculate the sum of all shipments between each company and each of it supplier company.
To further reduce the number of nodes, we will keep the records when the ‘total_shipped_weightkg’ is above 10,000,000 kg. This will retain only trading interactions involving a sizable weight (kg) of goods traded. This is an alternative to filtering using the ‘weight’ column which indicates the number of trading interactions. There are 19k edges after this step.
Show the code
# Subset the dataframe to keep only rows with valid values (top 20 in-deg) in 'target' column
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$target %in% top_n_in_list, ]
# Calculate the sum of all the shipment weight between each pair of companies in edge file.
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg_in %>%
group_by(source, target) %>%
mutate(total_shipped_weightkg = sum(sum_goods_weightkg))
# further removal of records where 'weight' column is less than 10
mc2_seafood_edges_agg_in <- mc2_seafood_edges_agg_in %>%
filter(total_shipped_weightkg>=10000000)The code chunk below first extracts the node dataframe from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.
Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_in’ edge file. There are 62 nodes after this step.
Rename the first two columns of the edge file to from and to for visNetwork to be able to recognise them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.
Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip.
Show the code
# extract nodes file from seafood_graph as a data frame
#nodes_seafood_vis <- as.data.frame(seafood_graph %>% activate(nodes))
# further processing
nodes_seafood_vis_in <- nodes_seafood_in %>%
#rename(group= rcvcountry) %>%
mutate(pagerank = round(pagerank, 5)) %>%
mutate(title = paste('shpcountry =', shpcountry, ',',
'rcvcountry =', rcvcountry, ',',
'\n In-deg = ',in_deg_centrality, ',',
"\n Betweenness =", betweenness_centrality, ',',
"\n Out-deg =", out_deg_centrality))As we want the nodes to be colour-coded by the in-deg centrality scores, we have first have to bin the in-deg scores into intervals using the cut() function. Next, rename the in_deg_grp column to group column for VisNetwork to colour nodes by in-deg score intervals.
Show the code
in_deg_brks <- c(0, 10000, 20000,30000,40000,50000, 60000,70000)
grps <- c('10,000 & Below','10,001-20,000', '20,001-30,000', '30,001-40,000','40,001-50,000', '50,001-60,000','60,001-70,000')
nodes_seafood_vis_in$in_deg_grp <- cut(nodes_seafood_vis_in$in_deg_centrality, breaks=in_deg_brks, labels = grps,include.lowest = TRUE)
#nodes_seafood_vis_in$in_deg_grp <- factor(nodes_seafood_vis$in_deg_grp, ordered = TRUE, levels = c('3001-6132','2001-3000','1001-2000','501-1000','500 & Below'))
nodes_seafood_vis_in <- nodes_seafood_vis_in %>%
rename(group = in_deg_grp)The code chunk below plots in interactive network graph using visNetwork.
Show the code
set.seed(1234)
visNetwork(nodes_seafood_vis_in,
mc2_seafood_edges_agg_vis_in,
main = "Ego Network of top 5 IN-DEG companies",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(arrows = 'to',
smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)The plot above shows not only the filtered ego network of the top 5 companies (in terms of in-deg scores) but also how these five companies are related to one another. Red node has the highest in-deg centrality scores , followed by blue and yellow. The top 5 in-deg companies are all related and buys from ‘Sea Breezes S.A de C.V Freight’.
The table below shows the names of the top 5 in-deg companies.
Show the code
| id | in_deg_centrality | pagerank | out_deg_centrality |
|---|---|---|---|
| Mar del Este CJSC | 63332 | 0.0278945 | 32 |
| hǎi dǎn Corporation Wharf | 57221 | 0.0355972 | 16 |
| Caracola del Sol Services | 50655 | 0.0167972 | 6 |
| Pao gan SE Seal | 35156 | 0.0124541 | 5 |
| Costa de la Felicidad Shipping | 30701 | 0.0094311 | 6 |
3.2.1 Temporal analysis of top 20 ‘in-deg’ centrality companies in trading occurrence over the years
First, obtain the list of company ids in the top 20 in-deg ranking.
Next, filter the edge file to keep only the records involving the top 20 in-deg companies. There are 185K records after this step.
Use the first code chunk below to compute the sum of monthly weights (interactons) between each unique pair of company in our filtered data. The second code chunk is to compute the year-month in which a source company has the max number of shipping interaction throughout the years.
Show the code
hm <- mc2_seafood_edges_agg_in_hm %>%
mutate(month = floor_date(arrivaldate, unit = "month")) %>%
group_by(target, month) %>%
summarise(total_weight_by_month = sum(weight)) %>%
ungroup()
hm <- hm %>% group_by(target) %>%
mutate(maxweightmonth = as.Date(month[which.max(total_weight_by_month)])) %>%
ungroup()Now lets plot heatmap
Show the code
plotfrom <- "2028-01-02"
plotto <- '2034-01-02'
ggplot(hm, aes(x = month, y = fct_reorder(target, maxweightmonth), fill = total_weight_by_month)) +
geom_tile(colour="White", show.legend=FALSE) +
scale_fill_distiller(palette="Spectral") +
scale_y_discrete(name="", expand=c(0,0))+
scale_x_date(name="Arrival Date",
limits=as.Date(c(plotfrom, plotto)),
expand=c(0,0),date_breaks = "1 year",
date_labels = "%Y") +
labs(title="Heatmap of shipping interactions",
subtitle=paste0("Top receiving companies from ",
plotfrom, ' to ', plotto)) +
theme_classic() +
theme(axis.line.y=element_blank(), plot.subtitle=element_text(size=rel(0.78)),
plot.title.position="plot",
axis.text.y=element_text(colour="Black",size=5),
plot.title=element_text(size=rel(2.3)))
Analysis of the plot above:
The heatmap above tells us the temporal trading interactions of the top 20 in-deg companies through the years. It is observed that company ‘Mar del Este CJSC’ has ‘hot’ patterns in the most recent years.
3.2.2 Comparing trading patterns of Mar del Este CJSC company and Pao gan SE Seal company with their suppliers
Let us examine the trading patterns of ‘Mar del Este’, the top leading companies in terms of ‘in-deg’ with four out of its many suppliers (as seen from the interactive graph).
First, filter records with only Mar del Este and 4 of its suppliers.
Next, plot a time series using geom_line() and geom_point_interactive().
Show the code
mar_in_df <- mar_in_df %>%
mutate(tooltip = paste0('# of interaction: ', weight, '\nDate :', arrivaldate))
mar_in_df1<- ggplot(mar_in_df %>%
filter(from=='Olas del Mar N.V.'),
aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 30),
breaks = seq(0, 30, by = 5),
expand = c(0, 0)) +
labs(title='Olas del Mar N.V.',
x = 'Date',
y='Number of trading occurrence')
mar_in_df2<- ggplot(mar_in_df %>% filter(from=='Sea Breezes S.A. de C.V. Freight '), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"),
by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 30),
breaks = seq(0, 30, by = 5),
expand = c(0, 0)) + # Remove padding around the y-axis limits
labs(title='Sea Breezes S.A. de C.V. Freight ',
x = 'Date',
y='Number of trading occurrence')
mar_in_df3<- ggplot(mar_in_df %>% filter(from=='Blue Horizon Family &'), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 30),
breaks = seq(0, 30, by = 5),
expand = c(0, 0)) +
labs(title='Blue Horizon Family &',
x = 'Date',
y='Number of trading occurrence')
mar_in_df4<- ggplot(mar_in_df %>% filter(from=='Wave Watchers Ltd. Liability Co'), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 30),
breaks = seq(0, 30, by = 5),
expand = c(0, 0)) +
labs(title='Wave Watchers Ltd. Liability Co',
x = 'Date',
y='Number of trading occurrence')
girafe(code = print(mar_in_df1 /mar_in_df2 /mar_in_df3/mar_in_df4),
#width_svg = 6,
height_svg =8,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
) Now let us examine the trading patterns of ‘Pao gan SE Seal’, one of the few top leading companies in terms of ‘in-deg’ with four out of its many suppliers. First filter all records involving Pao gan and its 4 suppliers.
Next, plot a time series using geom_line() and geom_point_interactive().
Show the code
paogan_in_df <- paogan_in_df %>%
mutate(tooltip = paste0('# of interaction: ', weight, '\nDate :', arrivaldate))
paogan_in_df1<- ggplot(paogan_in_df %>%
filter(from=='Paradera S.A. de C.V.'),
aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 100),
breaks = seq(0, 100, by = 20),
expand = c(0, 0)) +
labs(title='Paradera S.A. de C.V.',
x = 'Date',
y='Number of trading occurrence')
paogan_in_df2<- ggplot(paogan_in_df %>% filter(from=='Greek Octopus SRL Logistics'), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"),
by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 100), # Set the y-axis limits
breaks = seq(0, 100, by = 20), # Set the y-axis breaks at intervals of 5
#minor_breaks = seq(0, 30, by = 1), # Set the y-axis minor breaks at intervals of 1
expand = c(0, 0)) + # Remove padding around the y-axis limits
labs(title='Greek Octopus SRL Logistics',
x = 'Date',
y='Number of trading occurrence')
paogan_in_df3<- ggplot(paogan_in_df %>% filter(from=='Madhya Pradesh Market LLC'), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 100),
breaks = seq(0, 100, by = 20),
expand = c(0, 0)) +
labs(title='Madhya Pradesh Market LLC',
x = 'Date',
y='Number of trading occurrence')
paogan_in_df4<- ggplot(paogan_in_df %>% filter(from=='Sea Breezes S.A. de C.V. Freight '), aes(x=arrivaldate, y=weight)) +
geom_line( color="steelblue", size = 0.8) +
geom_point_interactive(aes(tooltip= tooltip),
size = 0.5) +
xlab("") +
theme_light() +
theme(#panel.spacing = unit(2, "lines"),
axis.text.x=element_text(angle=60, hjust=1),
panel.grid.major.x = element_blank(),
axis.title.y = element_text(size = 10)) +
scale_x_date(
breaks = seq(as.Date("2028-01-01"), as.Date("2034-12-31"), by = "3 months"),
limits = c(as.Date("2028-01-01"), as.Date("2034-12-31")),
labels = function(x) format(x, "%b %Y")) +
scale_y_continuous(
limits = c(0, 100),
breaks = seq(0, 100, by = 20),
expand = c(0, 0)) +
labs(title='Sea Breezes S.A. de C.V. Freight ',
x = 'Date',
y='Number of trading occurrence')
girafe(code = print(paogan_in_df1 /paogan_in_df2 /paogan_in_df3 /paogan_in_df4),
#width_svg = 6,
height_svg =8,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
) Analysis of the plot above: The trading interaction of ‘Mar del Este’ with its suppliers is more consistent throughout as compared to ‘Pao gan SE Seal’. Throughout the years, ‘Pao gan SE Seal’ has interactions many suppliers but some of these relationships are short term. . It dawned on me that we can identify IUU companies that frequently close down and re-register their companies from the type of plot above. For such companies, I would expect the buyer’s graph to show that buyer has been changing suppliers very frequently.
For our group project we can consider creating drop down list of ‘buying companies’ for user to interact with and see each of the buyer’s interaction with their top n suppliers over the years. A coordinated view can be created with social network graph such that when a buyer / supplier node is selected on the network graph, its trading activity over time with top n suppliers/ buyers is automatically generated.
3.3 Interactive graph for top 5 OUT-DEGREE CENTRALITY companies
Let us get the top 5 company names in terms of out-deg centrality scores.
Filter all records in the edge file involving the top 5 out-deg companies.
Next calculate the sum of all shipments between each company and each of it buyercompany.
To further reduce the number of nodes, we will keep the records when the ‘total_shipped_weightkg’ is above 10,000,000 kg. This will retain only trading interactions involving a sizable weight (kg) of goods traded. This is an alternative to filtering using the ‘weight’ column which indicates the number of trading interactions. There are 10k edges after this step.
Show the code
# Subset the dataframe to keep only rows with valid values (top 5 out-deg) in 'from' column (15K rows)
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg[mc2_seafood_edges_agg$source %in% top_n_out_list, ]
# Calculate the sum of all the shipment weight between each pair of companies in edge file.
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg_out %>%
group_by(source, target) %>%
mutate(total_shipped_weightkg = sum(sum_goods_weightkg))
# further removal of records where 'weight' column is less than 10
mc2_seafood_edges_agg_out <- mc2_seafood_edges_agg_out %>%
filter(total_shipped_weightkg>=10000000)The first code chunk below extracts the ‘nodes’ table from tbl_graph() object ‘seafood_graph’ created earlier. The reason for doing so is because we want the centrality values inside for tooltip later.
Next, trim this node dataframe by including nodes found only in the ‘from’ and ‘target’ columns in ‘mc2_seafood_edges_agg_in’ edge file. There are 24 nodes after this step.
Rename the first two columns of the edge file to from and to for visNetwork to be able to recognise them. Create a title column for visNetwork to display the tooltip when we hover our mouse over the edges later.
Similarly, in the nodes file we add a column title We will be displaying the ‘in-deg’, ‘betweenness’ and ‘out-deg’ scores in the tooltip.
Show the code
nodes_seafood_vis_out <- nodes_seafood_out %>%
#rename(group= rcvcountry) %>%
mutate(pagerank = round(pagerank, 5)) %>%
mutate(title = paste('shpcountry =', shpcountry, ',',
'rcvcountry =', rcvcountry, ',',
'\n In-deg = ',in_deg_centrality, ',',
"\n Betweenness =", betweenness_centrality, ',',
"\n Out-deg =", out_deg_centrality))As we want the nodes to be colour-coded by the out-deg centrality scores, we have first have to bin the out-deg scores into intervals using the cut() function. Next, rename the out_deg_grp column to group column for VisNetwork to colour nodes by out-deg score intervals.
Show the code
out_deg_brks <- c(0,10000, 20000, 30000)
grps <- c('10,000 & Below','10,001-20,001', '20,001-27,570')
nodes_seafood_vis_out$out_deg_grp <- cut(nodes_seafood_vis_out$out_deg_centrality, breaks=out_deg_brks, labels = grps,include.lowest = TRUE)
nodes_seafood_vis_out <- nodes_seafood_vis_out %>%
rename(group = out_deg_grp)The code chunk below plots the interactive graph.
Show the code
visNetwork(nodes_seafood_vis_out,
mc2_seafood_edges_agg_vis_out,
main = "Ego Network of top 5 OUT-DEG companies",
height = "500px", width = "100%") %>%
visIgraphLayout(layout = "layout_nicely") %>%
visEdges(arrows = 'to',
smooth = list(enables = TRUE,
type= 'straightCross'),
shadow = FALSE,
dash = FALSE) %>%
visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visInteraction(hideEdgesOnDrag = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)The chart highlights actors with high out-deg centrality scores over 2028 to 2034. In this default view, we can quickly identify the 2 coloured nodes as top 2 suppliers. Both top out-deg companies supplies to 3 common target companies; they are “Coasta de la Felicidad’, ‘Niger Bend Limited Liability..’ and”Pao gan SE Seal’.
The table below shows the names of the top 5 out-deg companies
Show the code
| id | out_deg_centrality | in_deg_centrality | pagerank |
|---|---|---|---|
| nián yú Ltd. Corporation | 27570 | 2 | 8.55e-05 |
| Sea Breezes S.A. de C.V. Freight | 24283 | 10 | 5.29e-05 |
| Estrella de la Costa SRL | 8862 | 7 | 8.12e-05 |
| Blue Horizon Family & | 7670 | 0 | 4.16e-05 |
| Madhya Pradesh Market LLC | 6789 | 1 | 4.27e-05 |
3.3.1 Temporal analysis of top 20 ‘out-deg’ centrality companies in trading occurrence over the years
First, obtain the list of company ids in the top 20 in-deg ranking.
Next, filter the edge file to keep only the records involving the top 20 out-deg companies. There are 43K records after this step
Use the first code chunk below to compute the sum of monthly weights (interactions) between each unique pair of company in our filtered data. The second code chunk is to compute the year-month in which a source company has the max number of shipping interaction throughout the years.
Show the code
hm <- mc2_seafood_edges_agg_out_hm %>%
mutate(month = floor_date(arrivaldate, unit = "month")) %>%
group_by(source, month) %>%
summarise(total_weight_by_month = sum(weight)) %>%
ungroup()
hm <- hm %>% group_by(source) %>%
mutate(maxweightmonth = as.Date(month[which.max(total_weight_by_month)])) %>%
ungroup()We will now plot the heatmap of top 20 out-deg companies to show their shipping interactions over the years.
Show the code
plotfrom <- "2028-01-02"
plotto <- '2034-01-02'
ggplot(hm, aes(x = month, y = fct_reorder(source, maxweightmonth), fill = total_weight_by_month)) +
geom_tile(colour="White", show.legend=FALSE) +
scale_fill_distiller(palette="Spectral") +
scale_y_discrete(name="", expand=c(0,0))+
scale_x_date(name="Arrival Date",
limits=as.Date(c(plotfrom, plotto)),
expand=c(0,0),date_breaks = "1 year",
date_labels = "%Y") +
labs(title="Heatmap of shipping interactions",
subtitle=paste0("Top supplier companies from ",
plotfrom, ' to ', plotto)) +
theme_classic() +
theme(axis.line.y=element_blank(), plot.subtitle=element_text(size=rel(0.78)),
plot.title.position="plot",
axis.text.y=element_text(colour="Black",size=5),
plot.title=element_text(size=rel(2.3)))
Analysis of the plot above:
At a quick glance, three companies are hot, they are ‘nian yu Ltd Corporation’ ,‘Playa del Tesoro OJSC’. and ‘Sea Breezes SA’.
‘nian yu Ltd Corporation’ had several instances of very high shipping activities in 2031 and 2032, which is worth looking into.
‘Madhya Pradesh Market LLC’ had interactions only from 2029 to 2032 first quarter. What happened to it?
3.3.2 Treemap of business relationship between shipping and receiving companies
Finally, lets build a treemap to get a high level view of the companies that these main out-deg companies ship the goods to.
The code chunk below groups the data by from and to columns and aggregating the TotalInteractions and MedianCargoWeight_daily between each pair shipping and receiving company by hscode.
The code chunk below plots the treemap using the treemap library.
Show the code

The most outer layer refers to shipping companies while the tiles within represents the companies that they shipped their goods to. The bigger the tile size, the more the interaction. The darker the colour, the greater the total shipment weight accumulated across the years.
The largest player here is ‘nian yu Ltd Corporation’ .
‘Playa del Tesoro OJSC’ is the ranked fifth in terms of out-deg centrailty and it ships in high frequency and large volume to its partners over the years 2028 to 2034.
The interactive version created with d3treeR library. But since this is not available in CRAN, we might not be able to use this in Shiny.
3.4 Correlationship between partnership intervals (in days) and total number of interactions between company pairs
In this section, lets examine whether there is a correlationship between the interval of interaction within a year and the number of interactions between each pairs of companies. Will there be companies with short partnership duration and high number of interactions (e.g. shipping in high frequency for only 1 week within a year) ?
We will prepare the dataframe needed for the plot.
First, for each group of from, to, year:
1) partnership_days : the number of days within the year that each pair of companies had interactions
2) total_interaction : sum of all the Weights between each pair of companies
3.4.1 Checking for statistical significance in correlationship
Show the code
library(scales)
correl_2028 <- ggscatterstats(data = cor %>% filter(year==2028),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2028",
x = "Partnership days", y = "Total interactions")
correl_2029 <- ggscatterstats(data = cor %>% filter(year==2029),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2029",
x = "Partnership days", y = "Total interactions")
correl_2030 <- ggscatterstats(data = cor %>% filter(year==2030),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2030",
x = "Partnership days", y = "Total interactions")
correl_2031 <- ggscatterstats(data = cor %>% filter(year==2031),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2031",
x = "Partnership days", y = "Total interactions")
correl_2032 <- ggscatterstats(data = cor %>% filter(year==2032),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2032",
x = "Partnership days", y = "Total interactions")
correl_2033 <- ggscatterstats(data = cor %>% filter(year==2033),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2033",
x = "Partnership days", y = "Total interactions")
correl_2034 <- ggscatterstats(data = cor %>% filter(year==2034),
x = partnership_days, y = total_interaction,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_x_continuous(limits = c(0, 365),
breaks=seq(0, 365, 30),
labels= comma) +
scale_y_continuous(limits = c(0, 1250),
breaks=seq(0, 1250, 250),
labels= comma) +
labs(title = "Number of interactions and partnership duration in 2034",
x = "Partnership days", y = "Total interactions")
# combining plots using patchwork
p_correl <- (correl_2034 + correl_2033) / (correl_2032 + correl_2031) / (correl_2030 + correl_2029) # + plot_spacer() + plot_spacer()
p_correl + plot_annotation(title = "Correlation between Number of interactions and partnership duration (days)",
theme = theme(plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 12))) + plot_layout(ncol = 1, nrow = 3,
heights = c(2,2))
The plots (non-parametric) above has p-values less than 0.05 and it suggests that there is a correlationship between the rank-transformed data of total interactions and partnership duration between companies . The upper outliers pairs of companies could be worth investigating because they have high number of interactions for a particular partnership duration in a year. For example, if a company A had interaction with company B for only 3 months but with exceptionally high number of trading interactions, should both companies be worth investigating?
3.4.2 Coordinated and interactive scatterplot
For usability , lets us plot an interactive and coordinate equivalent plot of the above scatterplot chart for the most recent four years 2034, 2033, 2032 and 2031.
First we will prepare the tooltip to be shown.
Next, we will filter and use geom_point_interactive() of ggiraph library to plot.
Show the code
scatter_2034 <- ggplot(data=cor%>% filter(year==2034),
aes(x=partnership_days, y=total_interaction)) +
geom_point_interactive(aes(data_id = label1,
tooltip= tooltip),
size = 0.8) +
geom_smooth(method='lm',
size = 0.5) +
scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +
labs(title = "Scatterplot of # of interactions and Partnership duration in 2034",
y = "Total interactions",
x = "Partnership duration") +
theme_minimal() +
theme(plot.title = element_text(size = 12, face ='bold'),
axis.title = element_text(size = 8, face= 'bold'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text = element_text(size = 8))
scatter_2033 <- ggplot(data=cor%>% filter(year==2033),
aes(x=partnership_days, y=total_interaction)) +
geom_point_interactive(aes(data_id = label1,
tooltip= tooltip),
size = 0.8) +
geom_smooth(method='lm',
size = 0.5) +
scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +
labs(title = "Scatterplot of # of interactions and Partnership duration in 2033",
y = "Total interactions",
x = "Partnership duration") +
theme_minimal() +
theme(plot.title = element_text(size = 12,face = 'bold'),
axis.title = element_text(size = 8, face= 'bold'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text = element_text(size = 8))
scatter_2032 <- ggplot(data=cor%>% filter(year==2032),
aes(x=partnership_days, y=total_interaction)) +
geom_point_interactive(aes(data_id = label1,
tooltip= tooltip),
size = 0.8) +
geom_smooth(method='lm',
size = 0.5) +
scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +
labs(title = "Scatterplot of # of interactions and Partnership duration in 2032",
y = "Total interactions",
x = "Partnership duration") +
theme_minimal() +
theme(plot.title = element_text(size = 12,face = 'bold'),
axis.title = element_text(size = 8, face= 'bold'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text = element_text(size = 8))
scatter_2031 <- ggplot(data=cor%>% filter(year==2031),
aes(x=partnership_days, y=total_interaction)) +
geom_point_interactive(aes(data_id = label1,
tooltip= tooltip),
size = 0.8) +
geom_smooth(method='lm',
size = 0.5) +
scale_x_continuous(limits = c(0, 365), breaks = seq(0, 365, 30)) +
scale_y_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 200)) +
labs(title = "Scatterplot of # of interactions and Partnership duration in 2031",
y = "Total interactions",
x = "Partnership duration") +
theme_minimal() +
theme(plot.title = element_text(size = 12,face = 'bold'),
axis.title = element_text(size = 8, face= 'bold'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text = element_text(size = 8))
girafe(code = print(scatter_2034 / scatter_2033 /scatter_2032 /scatter_2031),
width_svg = 6,
height_svg = 7,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
) Features of the chart above
Each point in chart represents a pair of companies that ships and receives, hover mouse over to see ‘company pair names’, ‘total interaction’ and ‘partnership days’ inside tooltip
If a particular company pair exists in more than one chart , its corresponding point will also be highlighted in the other charts.
If two companies consistently have similar number of interactions within similar partnership days across the years, then there is no issue. Explore the interactive chart to spot pairs of companies with ‘solo’ occurrence throughout the years and is also one of the outlier pairs. Those pair of companies could be worth investigating.
3.5 Comparison of centrality values across shipping countries
In this section, we will compare the centrality values of companies from different countries.
3.5.1 Comparison of out-deg centrality between top shipping countries
Let us compute top 5 shipping countries in terms of out-deg centrality scores.
Show the code
| shpcountry | sum_out_deg |
|---|---|
| Merigrad | 34396 |
| Isliandor | 24283 |
| Mawazam | 8891 |
| Osterivaria | 7670 |
| Marebak | 22 |
The code chunk below filters the top 5 shipping countries and non parametric one way anova test is performed to compare for significance in difference in median of number of interactions between them.
Show the code
ggbetweenstats(data = nodes_seafood_vis_out %>%
filter(shpcountry %in% c("Isliandor", "Osterivaria", "Vesperanda", "Mawazam", "Merigrad")),
x = shpcountry,
y = out_deg_centrality,
xlab = "Shipping Country", ylab = "Out-Deg centrality",
type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
sort = "descending",
sort.fun = median,
mean.ci = T, p.adjust.method = "fdr", conf.level = 0.95,
title = "Comparison of Median Out-Deg centrality across shipping Countries") +
scale_y_continuous(limits = c(0, 4000)) +
theme(axis.title.y=element_text(angle = 0,
vjust=0.9))
The P - value is above 0.05, so there is not enough statistical evidence to reject the null hypothesis that the median out-deg centrality scores between top 5 shipping countries are different.
3.5.2 Comparison of in-deg centrality between top receiving countries
Let us compute top 5 receiving countries in terms of in-deg centrality scores.
Show the code
# A tibble: 5 × 2
rcvcountry sum_in_deg
<chr> <dbl>
1 Oceanus 240470
2 Coralmarica 33
3 Marebak 4
4 Merigrad 1
5 Faraluna 0
There are only 3 major receiving countries in our filtered dataset .
Show the code
ggbetweenstats(data = nodes_seafood_vis_in %>% filter(rcvcountry %in% c("Jiraputra", "Oceanus",'Coralmarica','Marebak','Mawazam')), x = rcvcountry, y = in_deg_centrality,
xlab = "Shipping Country", ylab = "In-Deg centrality",
type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
sort = "descending",
sort.fun = median,
mean.ci = T, p.adjust.method = "fdr", conf.level = 0.95,
title = "Comparison of Median In-Deg centrality across receiving Countries") +
scale_y_continuous(limits = c(0, 150)) +
theme(axis.title.y=element_text(angle = 0,
vjust=0.9))
The P - value is above 0.05, so there is not enough statistical evidence to reject the null hypothesis that the median in-deg centrality scores between top 5 receiving countries are different.
3.6 Conclusion
In this exercise, we have identified the key centrality actors in terms of betweenness, in-degree and out-degree. My key findings are such:
Companies high in betweenness scores ‘buy’ and ‘sell’, although they tend to focus on one of the two activities. However, this companies usually do not have the highest in and out degree scores. We can derive simple business relationships as seen in section 3.1 and to get more relationship, we can simply extend the network graph.
Companies with high in-degree scores tend to focus heavily on buying only and companies with highh out-degree scores tend to focus on selling only. Companies with high in/out centrality scores can share similar suppliers/ buyers. A treemap (section 3.3.3) will be able to help us see such relationship.
Heatmaps allows us to see shipping activities for many companies at one go. We can look out for companies who have spikes in their selling and buying patterns to fish out illegal activities.
Multiple line plots (section 3.2.2) can show us the shipping patterns between a target company and some of its source companies over time. From target companies who keeps changing source companies, we could try to investigate them to check if they are dealing with companies who kept de-registering after being caught for IUU and re-registering to continue the business.
My key takeaway from this exercise:
- I filtered records too early, even though some edge weights are small, but they could provide the link to finding high betweenness nodes. By chopping away these edges to early, I was unable to get any betweenness scores for the remaining network in my filtered dataset.
4 References
Hohenfeld, F. (2021, August 12). Graphs Are Fun: An Introduction to Graphs in R. Hohenfeld.is. Retrieved from https://hohenfeld.is/posts/graphs-are-fun-an-introduction-to-graphs-in-r/
R Core Team. (2021, September 13). ggraph: A Grammar of Graphics for Graphs and Networks. The Comprehensive R Archive Network (CRAN). Retrieved from https://cran.r-project.org/web/packages/ggraph/vignettes/Edges.html
Datastorm-Open. (n.d.). visNetwork. GitHub Pages. Retrieved from http://datastorm-open.github.io/visNetwork/
Rivasiker, G. (n.d.). ggHoriPlot: Interactive Horizon Plot for R. GitHub Pages. Retrieved from https://rivasiker.github.io/ggHoriPlot/
Designing Interactive Treemap Using D3TreeR.” (n.d.). Retrieved from https://r4va.netlify.app/chap16.html#designing-interactive-treemap-using-d3treer
Public Health England. (2020). Visualising the spread of COVID-19 in England [PDF document]. Retrieved from https://extra.shu.ac.uk/ppp-online/wp-content/uploads/2020/11/visualising-spread-COVID19-england.pdf